Reading the hard stuff....forms
By David Justin Ross, CEO and President
RAF Technology, Inc.
More than 8 billion forms are processed annually in the US, primarily by expensive human labor. Government Imaging Magazine reports that fully-loaded data entry labor costs about $23/hour in the US and between $4 and $5/hour off-shore. In forms processing, about 80% of ongoing costs are labor for front-end data capture.
If you process a lot of forms, you've probably tried to reduce key entry costs with OCR. After all, any keystroke correctly captured by OCR doesn't have to be typed. And if you've tried OCR, you've learned that OCR accuracy is the throttle on all other costs--human editors, hardware, and the time to convert data. With inaccurate OCR, you just trade key-entry personnel for OCR cleanup people. A perfect OCR would only miss really bad characters, flag what it missed, and not flag anything else. A wrong, flagged character is bad, because you have to correct it. But right or wrong, all flags are bad because you still have to look at them, and that costs money too. Since forms data must be correct, (you only get one chance to get the patient's ID number right, for example) every flag must be looked at by an editor, hence the huge cleanup cost for most forms processing applications.
So how do you choose the OCR system with the highest accuracy on your forms? Keeping in mind that we want accurate data with minimal human labor -- let's look at the different kinds of errors.
A correct character that is flagged. Every flag on a correct character must be looked at by a human editor and is therefore a waste of money. Unnecessary flags come about because the OCR can't use information from the document to confirm the character the way a human typically resolves uncertain characters. If OCR could also use this information flag rates would go down.
A wrong character that is flagged. If the OCR gets a character wrong, you want it to tell you so you can clean it up. Still, you'd rather it didn't get it wrong in the first place. There's often enough information in the document to tell what the character should have been. If the OCR used that same information to get the character right and not flag it, you'd save the cost of correcting it.
A wrong character that is unflagged (substitution error). This is the worst kind of error, because you can't correct it. It's in the data forever. Most of the time, these errors result in data that doesn't make sense--a ZIP code that doesn't match a state name, a provider code that doesn't match the company name. If the OCR engine knew it had produced nonsensical results, it could flag the bad character or field(s). Then you could find and correct it.
The original data is wrong. An OCR designed for forms would say, "I don't care if it's a perfectly good 'A', it doesn't belong in a Social Security number. Flag it." A bad form caught this way costs a lot less to process than one that gets buried in the system.
So why do most OCR packages make these avoidable and costly mistakes? It's simple: they don't know your forms as well as you do. Off-the-shelf OCR packages were designed to read free-text documents, not forms. Their knowledge of English context doesn't do you any good if you are reading names, addresses, telephone, and provider numbers. Let's look at some problems from real-world forms and see what a smart OCR system--one that knew your documents--can do with them.
Take the character fragment shown in Figure A below, which was pulled off a stock certificate:
Looking at it by itself, none of us can tell what it is. Maybe a 'U'. In context, however, it becomes obvious (Figure B).
A smart OCR engine would know "SEVENTY-TWO" not "SEVENTY-TWU"can appear on a stock certificate. It would get the word right and not flag it.
What about a more complicated case? Common form types, like health insurance forms or a phone bill (Figure C), have a well-defined context, but it's a lot different than what you find in free-text documents.
There's a lot of information here, and it's organized both horizontally and vertically. Column 1 is the call number, running sequentially. Column 2 is the date, also in order. Column 3 is the time of the call, in order within a date. Column 4 always says "TO". Five has the (sometimes abbreviated) city name and 6 the state that were called. The number called is in column 7, the duration in 8, the rate code in 9, and the cost in 10. For each section, the costs total to a SUBTOTAL, and all the SUBTOTALS add up to a TOTAL that may be on another page.
A smart OCR would know all the rules of the document. It wouldn't let columns 1, 2, or 3 have results that were out of sequence, would know 4 always says "TO," would use a list of allowed city names and state names, would know the allowed rate codes, and would know how to add up the costs to get totals and subtotals. In addition, it would know what cities are in what state, what area codes and exchanges go with those cities, and if it's really smart, it might even know what a phone call to a certain destination should cost.
No human editor could memorize these rules to resolve all uncertain fields, but a computer can. Simply put, will you get a better answer if you guess what's on the page and try to correct it later or know what can occur and do recognition while validating against those rules?
A smart OCR would know all the rules of the document.
RAF has developed the first recognition engine that fully exploits the type of context and rules information described above to really automate forms Data Capture. On real forms, like the phone bills or health forms, RAF's FormLib+ uses this information to output more accurate answers and far fewer flags. In the real example of the phone bills, off-the-shelf OCR produced 30-50 flags per page. Using RAF's FormLib+, the number of flags was cut to 2, without substitutions! For the project's 50,000 pages per day, that's 2 million flags per day you don't have to look at. If your forms have significant context, we can do the same for you. *
RAF Technology (Redmond, WA 206-867-0700), Inc. provides custom OCR services for customers with large-volume, high-accuracy OCR tasks. RAF's Data Capture solutions are developed specifically for the customer's platform, database format and interface requirements. Our customizable off-the-shelf products, FormLib and FormLib+, are the first OCR engines designed at the architecture and algorithm level for forms processing, not free-text
recognition.
FormLib+ will be available as a compatible module to Cornerstone's InputAccel at AIIM 1996.
David Justin Ross, RAF's CEO and president, has over fifteen years
experience in
developing OCR, pattern recognition, and information management
solutions. Ross co-founded Calera Recognition Systems (now part of Caere), serving for seven years as vice president of engineering and as a
director.
Why other "solutions" don't solve the OCR forms problem
Why isn't voting multiple recognition engines enough?
Some companies run documents on several different OCR engines and then vote the results. They figure that if one engine gets a character wrong, another might get it right. There are several problems with this approach. First, standard OCR engines were not designed to be independent of each other. Too often, the engines all make the same mistake. Second, most characters are correctly recognized, and those extra engines just slow down your processing immensely. Third, it costs a lot in software licensing fees and extra hardware to run all those extra engines. Finally, and most important, each engine still does OCR "blind," ignoring the structure and context of your forms data.
Why aren't forms processing packages with a post-processing dictionary as effective as using context inside recognition?
Companies that market forms processing packages buy standard OCR, but can't get inside the engines. The OCR engine tries to reconcile your forms with English rules, and flags what doesn't fit. The post-processing validation software then compares the OCR output to your rules and flags items that don't conform. Unfortunately, these corrections are done blind. The validator doesn't have the image. It only knows what the OCR thought was on the image. The result is lots of unnecessary flags that add to your cleanup costs.
How is Line Avoidance different from line or forms removal?
A lot of errors in forms recognition comes from characters which touch or overlap lines. Forms processing packages try to remove lines, since they confuse most OCR engines. This approach fails because it is impossible to decide which pixels belong to a character without knowing what the character is.
RAF's Line Avoidance is a critically different approach. Once the form template has been identified, LA knows where the lines are, but doesn't remove them. It uses the information while doing recognition. Since there is no way of knowing what character parts are hidden beneath the line, LA doesn't use that part of the character to make its decision. Consider the case where a line obscures the bottom of an 'E' or 'F.' Standard line removal removes the line, turning all 'E's into 'F's. "Intelligent" line removal or "line reconstruction" leaves the line, turning all 'F's into 'E's. LA knows that the character might be an 'E' or an 'F' and uses context to decide the correct answer. Only Line Avoidance gets it right.
IW Special Supplement, March 1996
|